This is part 3 (part 1 and part 2) of a series of tutorials on web scraping in R. If you are new to web scraping, please see the first two parts before continuing.
In this tutorial, we’ll see how to get information using a public API. Specifically, we are going to see how to get lyrics from the popular site genius.com. We’ll use a dedicated R package {geniusr} to do this.
In the previous tutorials, we saw how to scrape data in a way that essentially mimicked what a human user would do: we went to a url, identified the information we wanted, then we “copied” that information into a dataframe in R. Our program in R was able to read and parse the HTML code in order to automatically extract the data we wanted. This method (also known as DOM parsing) is a common approach to scraping, but it is not the only, or even the most efficient, method. APIs are another very common way to access and acquire data from the web.
What is an API?
Instead of downloading a dataset or scraping a site, APIs allow you to request data directly from a website through what’s called an Application Programming Interface. Many large sites like Reddit, Spotify, and Facebook provide APIs so that data analysts and scientists can access data quickly, reliably, and legally. This last bit is important. Always check if a website has an API before scraping by other means. The following brief explanation is adapted from this post at dataquest.io.
‘API’ is a general term for the place where one computer program interacts with another, or with itself. We will be working with web APIs here, where two different computers — a client and server — interact with each other to request and provide data, respectively. APIs provide a way for us to request clean and curated data from a website. When a website like Facebook sets up an API, they are essentially setting up a computer that waits for data requests.
Once this computer receives a data request, it will do its own processing of the data and send it to the computer that requested it. From our perspective as the requester, we will need to write code in R that creates the request and tells the computer running the API what we want. That computer will then read our code, process the request, and return nicely-formatted data that can be easily parsed by existing R libraries.
Why is this valuable? Contrast the API approach to “pure” web scraping that we used in the previous tutorials. When a programmer scrapes a web page, they receive the data in a messy chunk of HTML. While we were able to use libraries, e.g. {rvest} to make parsing HTML text easier, we still had to go through multiple steps to identify the page urls, and the correct bits of HTML to give us what we wanted. This wasn’t too hard with our toy examples, but it can often be quite complicated.
APIs offer a way to get data that we can immediately use, which can save us a lot of time and frustration. Many big sites have R packages that are specifically dedicated to interfacing with those sites’ APIs. {geniusr} is such a library. Other examples include {RedditExtractoR}, {twitteR}, {Rfacebook}, and {spotifyr}. Otherwise, you can use the {httr} and {jsonlite} packages to work with APIs more generally. These are a bit more advanced, and we will not go into these in this session (but see here and here for an introduction).
R libraries
Libraries we’ll be using:
library(tidyverse) # for data wrangling
library(tidytext) # for text mining
library(geniusr) # for getting lyrics
library(spotifyr) # for getting info from Spotify
library(tictoc) # for timing
library(usethis) # for editing environment files
## Some plotting libraries
library(ggridges)
library(wordcloud)
library(patchwork)
library(RColorBrewer)
Again, this tutorial assumes you are somewhat familiar with the {tidyverse} suite of functions, particularly {ggplot2} and {dplyr}. You can learn more about how to make the most of the tidy approach to programming in R, a good place to start is the R for Data Science book by Hadley Wickham and Garret Grolemund.
We’ll also be making extensive use of the {tidytext} package, which you can find more info about in the Text Mining with R book online.
Authenticating and setting the Genius API token
To get lyrics, we’re going to be using the really cool {geniusr} package. You can install it and load it in the usual way, but in order to actually get lyrics, there are a few things that we need to do. Getting set up is a bit complicated, but once we’ve got it working, the API is extremely easy to use.
According to the {geniusr} package’s documentation on GitHub page, we need to do the following:
- Create a (free) account with genius.com: https://genius.com/
- Create a Genius API client here: https://genius.com/api-clients/new
- Generate a client access token from your API Clients page: https://genius.com/api-clients
- Set your credentials in the System Environment variable
This may not make much sense yet, so I’ll try to explain quickly.
Step 1: Creating an account
I won’t go over how to create an account. Just follow the guidance on the main site and you should be fine.
Step 2: Creating a Genius API client
The next thing we need to do after creating an account is create a Genius API client. To do this you’ll need to go to this page, which should look something like the picture below.
Blank Genius client page
Now you just need to fill in the information you will use for this client. This can really be anything at all. All you need is a name and a dummy website, hit “Save”, and that should work. For example, here is the client ‘jasong1’ that I created:
My Genius client page
Notice that there are two bits of Client information: a Client ID and a Client Secret. Each one should be unique. I recommend that you copy both of these (not the ones above, but the ones on your new client) into a text file and keep that somewhere handy. I have a file client_info.txt which I keep in my project directory. It looks like this (with the proper information filled in of course):
## Genius API client info
client_id = "TeRdss..."
client_secret = "..."
Step 3: Generating a client access token
Once you have a client, all you need to do is create an access token by clicking “Generate Access Token”, and copying that into your text file. So just add a line:
client_access_token = "..."
Step 4: Setting credentials in R
The last step involves setting the System Environment variable that R will use to communicate with Genius. The name of the variable is GENIUS_API_TOKEN, and you can set this by calling the genius_token() function and entering your Genius Client Access Token when prompted. This must be done each time you start R.
There is however an easy way to automatically have this token added when you start, instead of doing it every time you start your project up. According to the help file ?genius_token,
“The easiest way to accomplish this is to set it in the ‘
.Renviron’ file in your home directory.”
Now, you’ve probably never thought about this .Renviron file, let alone altered it, and it’s not always obvious where this file is located. Fortunately, it’s very easy to edit with the {usethis} package, which provides a few functions for creating and editing .Renviron files (among other things).
usethis::edit_r_environ()will open your user.Renvironwhich is in your home directoryusethis::edit_r_environ("project")will open the.Renvironfile in your “project” directory
Simply enter the following in the .Renviron file (where [[my_token]] is your Genius API token, without brackets).
GENIUS_API_TOKEN=[[my_token]]
Save the file and restart RStudio. The token will now automatically be added every time you start R (if you added it to your home directory) or only this particular project (if you added it to this project directory).
Getting artist and song info
Now that we’re all set, we can use the {geniusr} package to very quickly and easily get song lyrics from the genius.com website. We’ll start by looking up lyrics and other information on one of my favourite bands, The Smiths.
The Smiths in 1985. Image Source: https://www.prsformusic.com/m-magazine/features/picture-smiths-salford-lads-club-november-1985/
Besides the fact that they happen to be one of my favourite bands, analyzing data from this band also helps illustrate some of the problems and pitfalls we often run into when doing certain kinds of text mining analyses, such as sentiment analysis. But that’s yet another tutorial…
By all means, feel free to experiment with data your own favourite artists.
Find the artist ID
The first thing we’ll need to do is find the artist ID that Genius uses. These are unique identifiers for each artist, song, album, etc. stored by Genius. We can use artist and song names in some functions, but searching by IDs is quicker and more reliable. But since we don’t yet know what the ID for this artist is, we need to begin with a name search. We do this with the search_artist() function like so.
search_artist("The Smiths")
Cool. We have a simple little tibble object with the artist_id, artist_name and artist_url. So the artist ID for The Smiths is 16669. Now that we have that, we’ll use it to get all the songs that Genius lists for this band. We’ll use the get_artist_songs_df() function and save the output to smiths_songs.
smiths_songs <- get_artist_songs_df(16669)
smiths_songs
Now we have a dataframe of all the songs by The Smiths listed on Genius. The next step is to get the songs, but first we might want to do a bit of cleaning…
Filter duplicate songs
One thing you’ll probably notice is that we have several different versions of some of the songs, e.g. live recordings, single versions, etc. If we want to analyze the song data—getting word frequencies and so on—we don’t want duplicate song in our data, as that will skew our counts. It’s possible that the lyrics are different in some of these different versions, but we’ll assume for now that they’re all the same. So we’ll need to sort out all these “duplicate” songs so they don’t mess with our counts later on.
There are many ways to do this, and if you have a better way, by all means use it. The way I’ll do it is to create a vector of unique songs, and use that to filter the smiths_songs dataframe. To create this vector, I’ll just use regular expressions to remove the extra information, e.g. the “[Rank]”, “[PeelSessions]”, and “(tate)” bits, from the column song_name and then use unique() to get only the unique names.
unique_smiths_songs <- smiths_songs$song_name %>%
str_replace_all("\\(.*\\)", "") %>% # remove anything in between parentheses
str_replace_all("\\[.*\\]", "") %>% # remove anything in between [] brackets
str_trim() %>% # remove extra whitespace
unique()
head(unique_smiths_songs, 15)
[1] "Accept Yourself"
[2] "A Rush and a Push and the Land Is Ours"
[3] "Ask"
[4] "Asleep"
[5] "Back to the Old House"
[6] "Barbarism Begins at Home"
[7] "Bigmouth Strikes Again"
[8] "Cemetry Gates"
[9] "Death at One's Elbow"
[10] "Death of a Disco Dancer"
[11] "Frankly, Mr Shankly"
[12] "Girl Afraid"
[13] "Girlfriend in a Coma"
[14] "Golden Lights"
[15] "Half a Person"
Looks good. Now we filter the smiths_songs dataframe to exclude the duplicates.
smiths_songs_unique <- smiths_songs %>%
dplyr::filter(song_name %in% unique_smiths_songs)
smiths_songs_unique
This is more like what what we’re looking for. We have 75 songs to work with now
Getting lyrics
Getting lyrics is easy with the get_lyrics_id() function, but as the name implies, we need to use the unique ID for the song, not just the song name. {geniusr} doesn’t have a function for getting lyrics based on song names, I assume because there are potentially lots of different songs of the same name, so searching by name could cause problems. Fortunately, the information we need is already in the song_id column in our smiths_songs_unique dataframe. For example, the song “Panic” has the ID 208676. We can use that to get the lyrics like so.
get_lyrics_id(208676)
In the resulting dataframe, each line from the song is represented by one row. All the other information is there as well. Depending on what we want to do, we may need to alter this format, and we will, but first we have to get all the songs.
What we need to do is got through all the song IDs in smiths_songs_unique and get the lyrics for each one, then add them to a single dataframe. I’m going to create a custom function for getting lyrics, which is a little more helpful, then I’ll use lapply() to create a list containing the lyrics for each song. There are surely other ways to do this, but this way works fine, so what difference does it make?
This may take a few minutes.
my_get_lyrics <- function(id){
out_put <- tryCatch({
smiths_lyrics_df %>%
bind_rows(get_lyrics_id(id))
}, error = function(cond) {
# Tell me which song it can't find and simply return the unaltered dataframe
message("Couldn't find lyrics: ", smiths_songs_unique[smiths_songs_unique$song_id == id, "song_name"])
return(smiths_lyrics_df)
})
}
# time the process
tic()
smiths_lyrics_df <- lapply(smiths_songs_unique$song_id, my_get_lyrics)
toc()
# Now bind the list into a single dataframe
smiths_lyrics_df <- bind_rows(smiths_lyrics_df)
We get a dataframe with each row representing a line from a song.
smiths_lyrics_df
We now have all the lyrics for all the songs by The Smiths. That’s all there is to it!
Now we have the lyrics, so let’s take a look at some patterns in the songs.
Organizing and summarizing the data
We’ll use the unnest_tokens() function in the {tidytext} package to split the words in the line column and put them into individual rows. This is a very nice format for working with text. Note that this function also converts all words to lower case.
tidy_smiths_lyrics <- smiths_lyrics_df %>%
unnest_tokens(
word, # name of the output column
line # name of the column to separate into tokens
)
tidy_smiths_lyrics %>%
select(song_name, word)
In addition to single words (unigrams), we can create similar dataframes containing ngrams of different sizes. An ‘ngram’ is simply a sequence of n words, usually 2 (bigrams) or 3 (trigrams) sequences. We can create a dataframe tidy_smiths_bigrams where we now have all the bigrams in each song.
tidy_smiths_bigrams <- smiths_lyrics_df %>%
unnest_ngrams(
ngram, # name of the output column
line, # name of the column to separate into tokens
2L # the number of words in the sequence
)
tidy_smiths_bigrams %>%
select(song_name, ngram)
We can do the same with trigrams.
tidy_smiths_trigrams <- smiths_lyrics_df %>%
unnest_ngrams(
ngram, # name of the output column
line, # name of the column to separate into tokens
3L # the number of words in the sequence
)
tidy_smiths_trigrams %>%
select(song_name, ngram)
In a repetitive song, we’d expect to find not just a lot of repeated individual words, but a lot of repeated word sequences.
We can also create a dataframe smiths_lyrics_summary that summarises information about each song, which we’ll use for quick reference later on. There’s a lot we could do with this, but for now we’ll just calculate a few things: n, the number of word tokens in each song; n_types, the number of unique word types in each song; and ttr, the ratio of types to tokens, calculated as \(n / n\_types \times 100\).
smiths_lyrics_summary <- tidy_smiths_lyrics %>%
group_by(song_name) %>%
summarise(
n = length(word),
n_types = length(unique(word)),
ttr = 100*n_types/n
) %>%
distinct() %>% # remove duplicates
drop_na()
smiths_lyrics_summary
We can do the same with our bigram and trigram dataframes and then combine them to have a single summary dataframe. We create individual summary dataframes for our bigrams and trigrams, and then use inner_join() to add them to our original smiths_lyrics_summary dataframe.
smiths_bigrams_summary <- tidy_smiths_bigrams %>%
group_by(song_name) %>%
summarise(
n_bigrams = length(ngram),
n_bigram_types = length(unique(ngram)),
ttr_bigrams = 100*n_bigram_types/n_bigrams
) %>%
distinct() %>% # remove duplicates
drop_na()
smiths_trigrams_summary <- tidy_smiths_trigrams %>%
group_by(song_name) %>%
summarise(
n_trigrams = length(ngram),
n_trigram_types = length(unique(ngram)),
ttr_trigrams = 100*n_trigram_types/n_trigrams
) %>%
distinct() %>% # remove duplicates
drop_na()
smiths_lyrics_summary <- smiths_lyrics_summary %>%
inner_join(smiths_bigrams_summary) %>%
inner_join(smiths_trigrams_summary)
smiths_lyrics_summary
So we have some useful summary info now. In principle we could add lots more to this, e.g. count different parts of speech, specific words, or whatever information we want.
Numbers of words per song
We can look at the longest and shortest songs.
smiths_lyrics_summary %>%
arrange(desc(n_types)) %>%
mutate(song_name = as.factor(song_name)) %>%
ggplot(aes(fct_reorder(song_name, n, .desc = F), n)) +
geom_col(fill = "#ea80fc", width = .6) +
labs(x = "", y = "Number of word tokens") +
coord_flip() +
ggtitle("All songs by The Smiths") +
theme_minimal()
We can look at the “wordiest” songs, as measured by the number of unique words.
smiths_lyrics_summary %>%
arrange(desc(n_types)) %>%
slice(1:10) %>%
ggplot(aes(reorder(song_name, n_types), n_types)) +
geom_col(fill = "#03dac6", width = .8) +
coord_flip() +
labs(x = "", y = "Number of word types") +
ggtitle("Top 10 songs with the most unique words") +
theme_minimal()
We can also look at the most frequent words across all the songs.
tidy_smiths_lyrics %>%
count(word, sort = T)
Sometimes word clouds can be fun to look at, though they’re not terribly informative.
pal <- brewer.pal(8, "Dark2")
tidy_smiths_lyrics %>%
count(word, sort = T) %>%
with(wordcloud(
word, n, random.order = FALSE,
max.words = 100, colors = pal))
There’s nothing really interesting here, as many of our top words are ones like the, and, and personal pronouns, which would be at the top of the list for just about any corpus. What we really want to know are the patterns among the “content” words, and to get those we need to filter out the high frequency function words and/or other words that aren’t very meaningful for our purposes. These are referred to as stopwords.
Removing stopwords
It’s common in many corpus analysis and text mining tasks to remove function words that (people often assume) do not contribute much to the meaning or content of a text. Things like determiners, pronouns, and prepositions are thus often excluded from analyses. When working with text mining applications, you often see comments like the following, from this blog post (emphasis mine):
Cleaning natural language is like panning for gold: most of language is useless, but every once in a while we find a gold nugget. We want to get only the nuggets.
From a lingust’s perspective, of course, the idea that most of language is useless is ridiculous. The simplistic assumption that we can get an accurate view (or even a glimpse) of the semantic content of a text by ignoring function words is easy to pick apart, but it is often seductive to non-linguists. We’ll come back to this later. Still, even if we acknowledge that function words are important, it can be useful at times to set them aside.
The stop_words dataset in the {tidytext} package contains stopwords from three different lexicons: SMART, snowball and onix (see ?stop_words for links to these lists and sources). We can use them all together, or use a subset of stop words if that is more appropriate for a certain analysis. Here I’m goign to use the snowball set, which is the smallest list, made up of personal pronouns, determiners, conjunctions, prepositions, negative markers, a few intensifiers, and modals, copula, and other auxiliary verbs.
stop_words %>%
dplyr::filter(lexicon == "snowball")
In addition to these, I’m going to add a few words that come up often in the song lyrics, e.g. oh, la, yeah, and hey. Once we have our list of stopwords, we can filter them our and then count up our content words.
jg_stopwords <- data.frame(
word = c("oh", "ooh", "la", "hey", "na", "ya", "yeah", "smiths", "https", "genius.com"),
lexicon = "JG",
stringsAsFactors = FALSE
)
snowball_stopwords <- stop_words %>%
dplyr::filter(lexicon == "snowball") %>%
bind_rows(jg_stopwords)
Now we use the anti_join() function to combine these dataframes
tidy_smiths_content <- tidy_smiths_lyrics %>%
anti_join(snowball_stopwords) %>%
dplyr::filter(!str_detect(word, "\\d+")) # remove numbers
tidy_smiths_content %>%
count(word, sort = T)
We can plot a wordcloud of the most frequent words in Morrissey’s lyrics.
tidy_smiths_content_summary <- tidy_smiths_content %>%
count(word, sort = T)
wordcloud(
tidy_smiths_content_summary$word,
tidy_smiths_content_summary$n,
random.order = FALSE,
max.words = 60,
colors = pal)
Repetitiveness and lexical richness
Beyond simple word token and type counts, we might consider whether the increasing lengths are just a matter of greater repetitiveness, or even simply longer songs, which might contain more repetitions of the refrain or chorus. What happens when we compare the type-token ratio (TTR) per song. TTR is a common measure of lexical density or richness, where higher TTR reflects a more dense text, and low TTR are relatively repetitive one.
smiths_lyrics_summary %>%
ggplot(aes(reorder(song_name, ttr), ttr)) +
geom_col(fill = "#099087", width = .6) +
coord_flip() +
labs(x = "", y = "Type-token ratio * 100") +
ggtitle("Type-token ratio of The Smiths' songs") +
theme_minimal()
A problem with this method is that it is well-known that TTR is correlated with text length, such that as the total number of words increases, the TTR decreases asymptotically (see e.g. Tweedie & Baayen 1998). We can’t quite see this in our data here, but there is a clear correlation.
smiths_lyrics_summary %>%
ggplot(aes(n, ttr)) +
geom_point(col = "#099087") +
geom_smooth(method = "lm", col = "black") +
labs(x = "Number of tokens", y = "Type-token ratio * 100") +
ggtitle("Number of tokens by TTR for The Smiths' songs") +
theme_minimal()
Anyway, there is a whole lot you can do with lyrics once you have them, e.g. sentiment analysis, topic modelling, and so on.
References
Tweedie, Fiona J. & Harald Baayen. 1998. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 32(5). 323–352.